*This is the file I went through in the Oct 21, 2018 video I posted, but I changed it a bit after that.  

*GOAL
*Create a file for analyses with
*-the contest as the unit of analysis
*-dependent variable: the percent of the 2-party vote going to the Democrat

*THEME: DEATH BY A 1,000 CUTS.  

clear
version 15.1
set varabbrev off, permanently
cd C:\Users\User\Dropbox\04_SLERs\CODE\UOAContest

global datezzz 20181021

*put name of main file to be created here
global mainfile 102slersuoacontest$datezzz
*put the name of the slers file used here
global slersfile 001_196slers1967to2016_20180908

*CHAMBER SEATS
*Create file with the number of seats up in each chamber and the proportion of seats up in each chamber.  
*EXPLANATION: THESE CHAMBER-YEAR VARIABLES WILL BE USED TO 1) ASSESS WHETHER ALL SEATS ARE ACCOUNTED FOR AT THE END AND 2) AID FIGURING OUT WHETHER LAGGED VARIABLES NEED TO BE CHANGED TO SYSTEM MISSING BECAUSE OF REDISTRICTING.
clear
import excel 002From_StatePartisanBalance1777to2016_20171027_SourceFiles.xlsx, firstrow
rename electyear year
drop if year<1967
gen sen=chambercode==8
rename stateno sid
keep year sid sen totinsess
save tempchamberseats, replace
clear
use $slersfile
keep if deter==1&outcome=="w"
gen temp=mod(termz,1)
replace year=year-1 if temp==.5
*EXPLANATION: THE FOLLOWING LINE IS NECESSARY BECAUSE WE WANT TO COUNT SEATS.  BUT IF ONE WINNER HAS MULTIPLE LINES IN ONE CONTEST BECAUSE 1) THEY ARE RUNNING ON MULTIPLE PARTIES OR 2) COUNTY BREAKDOWNS ARE REPORTED, THEY HAVE TO BE REDUCED TO ONE LINE.  
collapse (mean) eseats, by(year sid sen dname dno geopost mmdpost cand)
gen c=1
collapse (mean) eseats (sum) c, by(year sid sen dname dno geopost mmdpost)
replace eseats=2 if sid==45&sen==0&year==1986&dname=="orleans"&dno==3
assert eseats==c
drop c
collapse (sum) seatsup=eseats, by(year sid sen)
merge 1:1 year sid sen using tempchamberseats
drop if _merge==2
gen propup=seatsup/totinsess
drop if propup==.
assert propup<=1
*seatprop never more than 1, good
assert sid==2&sen==1&year==2012 if propup>.9&propup!=1
*that should be ak 2012 sen propup=.95, if its anything else or more, there could potentially be a mistake.  
drop _merge
rename totinsess totalseats
save tempchamberseats, replace

*SLERs
clear
use $slersfile

*RUNOFFS
*for the three runoff general elections in the dataset, change the winner of the runoff to the winner in the first round, and change deter=0, and changes eseats.  The first round can be used for vote share, but now the winner doesn't correspond necessarily to the highest vote getter.  But the ultimate winner in the runoff is preserved so that winners can be aggregated when appropriate.  
*GA 1968 HS
gen temp=sid==10&sen==0&dno==73&mmdpost==2&year==1968
list etype deter cand outcome eseats dtype caseid if temp
replace outcome="w" if caseid==46053
replace outcome="l" if caseid==46052
replace eseats=1 if temp&etype=="gfunset"
replace deter=1 if temp&etype=="gfunset"
replace deter=0 if temp&etype=="grunoff"
*GA 2010 HS
replace temp=sid==10&sen==0&dno==29&year==2000
list etype deter cand outcome eseats dtype caseid if temp
replace outcome="w" if caseid==42769
replace outcome="l" if caseid==42772
replace eseats=1 if temp&etype=="gfunset"
replace deter=1 if temp&etype=="gfunset"
replace deter=0 if temp&etype=="grunoff"
*VT 1986 HS
replace temp=sid==45&sen==0&dname=="orleans"&dno==3&year==1986
list etype deter cand outcome eseats dtype caseid if temp
replace outcome="w" if caseid==240180
replace outcome="l" if caseid==240178
replace eseats=2 if temp&etype=="gfpartunset"
replace deter=1 if temp&etype=="gfpartunset"
replace deter=0 if temp&etype=="grunoff"
drop temp

*CASE SELECTION
*identify cases that are missing important variables as cases not to use.
*dontuse=1 means that the election shouldn't be used for an analysis of the determinants of vote share, although it might be appropriate to use for other purposes, such as tabulating winners for a party by chamber, etc.  
*The following are held at irregular times, but should be kept for lagging vote share.  
gen dontuse=0
foreach string in dno party eseats etype outcome writeinstatus generalproblem identity day vote incompleteelect {
gen temp=regexm(uncert,"`string'")
replace dontuse=1 if temp==1
drop temp
}
*FL 2014 HS
*The outcome of the following election was thrown out.  But the person who received the most votes in the general election in question also won the special election that was called later to fill the seat, so the special election can be changed to deter=1 and the initial election can be kept.  
drop if sid==9&sen==0&dno==64&year==2015
replace deter=1 if sid==9&sen==0&dno==64&year==2014&etype=="g"
*DONTUSE
replace dontuse=1 if etype!="g"&etype!="gs"&etype!="ssg"
*KEEP
keep if deter==1|etype=="g"|etype=="gs"|etype=="ssg"

*drop cases that are in non-partisan election state-years.  
drop if sid==27|(sid==23&year<1974)

*OVERVIEW: HOW NON-MAJOR PARTY CANDIDATES & VOTES DELT WITH.  
*A variable "bigthird" will track whether contests should be excluded because of a large third party presence.  
*If more than 20% of total votes are for non-major party candidates, the contest is excluded from analysis.  (Non-major party incumbents or other prior legislators also result in exclusion.)
*(An alternative approach would be to model the impact of varying percents of non-major party votes on vote share.)
*However, before computing this amount, we should exclude small write-in candidates from the denominator of the above fraction.
*This should be done because states and years vary greatly in whether write-in / scattering figures are reported, and whether they were collected.  Excluding them makes the comparable over time.  
*Write-in candidates who were incorrectly coded as Democrats or Republicans should also be identified and dropped.  

*WRITEIN
*The following has to be done before the collapse.  
gen writein=1 if caseid==142300
replace writein=1 if caseid==142487
egen max=max(writein), by(year sid sen dname dno geopost mmdpost etype)
list year sid sen dname dno geopost mmdpost etype party partyz partyt cand vote if max==1
*I'm not sure why I concluded in the past these were writeins and not filed independents.  Make them writeins.  
replace party="writein" if writein==1
replace partyz="writein" if writein==1
replace partyt="writein" if writein==1

*Verify that partyt doesn't vary within a candidate-election.
bysort year sid sen dname dno geopost mmdpost etype candid: gen sum1=_N
bysort year sid sen dname dno geopost mmdpost etype candid partyt: gen sum2=_N
assert sum1==sum2

*UOA CAND-PARTY
*Collapse data so that county breakdowns are no longer present.  This will make it easier to see if candidates are running on multiple lines.  
gen votemiss=vote==.
collapse (mean) votemiss (sum) vote (max) dontuse firstcase, by(year sid sen dname dno geopost mmdpost specpost cand candid termz outcome exper tenure1 tenure2 deter etype eseats dseats dtype popnum regime redist redist1 redist2 redist3 nest nest1 nest2 nest3 party partyz partyt)

*VOTEMISS
*verify that votemiss is either 0 or 1, and not in between.
assert votemiss==0|votemiss==1
*that is correct
assert vote==0 if votemiss==1
*replace vote=0 with system missing to deal with Stata's problem with this.  
replace vote=. if votemiss==1

*WRITEINS
*Get rid of scattering of all vote amounts as well as write-ins identified by name who receive less than 5% of the vote
drop if cand=="scattering"|cand=="writein"

*NONMAJ
*Drop non-major party candidates (including writeins) if all such candidates in one election received 5% or less of the total vote.  Doing this won't influence how the variable "dontuse" is coded, it will merely make it faster to assess who is a write-in and who isn't.  
bysort year sid sen dname dno geopost mmdpost etype candid: gen rows=_N
gen tempvote=vote if partyt!="d"&partyt!="r"
egen sum1=sum(tempvote), by(year sid sen dname dno geopost mmdpost etype)
egen sum2=sum(vote), by(year sid sen dname dno geopost mmdpost etype)
*Contests with more than 1 seat should have the percentage adjusted upwards.  That means that if there is (say) a three seat contest, non-major party candidates will only be dropped if they collectively have 1.67% or less of the total vote.  This is because they are more likely to influence the fortunes of whether specific Democratic or Republican candidates win, per percentage point of the total, if there are more seats to win.  
gen nonmajvote=((sum1*eseats)/(sum1+sum2))*100
drop if partyt!="d"&partyt!="r"&nonmajvote<5&rows==1
drop sum1 sum2 nonmajvote

*WRITEINS
*Clear up problems associated with probable writeins who weren't coded as such in SLERs (I will deal with these better in the future).  First I identify those getting fewer than 10 votes, than those getting between 11 and 20 votes just to see how many there are of each.  Not all candidates with those codes for v20 are write-ins, the number of votes they get is part of the evidence that they’re a write-in. 
*Are there any writeins who are a separate line for a fused candidate, and so the writein status in question can be ignored?
tab partyz partyt
*Yes, when partyz=writein, partyt=d 20 times, nonmaj 1 time, and r 24 times.  But make sure those are separate lines within one election.
tab partyt if rows!=1
*There are 13 writeins, that is potentially a problem.  
gen temp=partyt=="writein"&rows!=1
egen max=max(temp), by(year sid sen dname dno geopost mmdpost etype candid)
list year sid sen dname dno geopost mmdpost etype cand candid party partyz partyt vote if max==1
*Those are legit, except for perhaps one (writein dem who received 7496).  One other was a writein repub who received 312, but that's not much.  Why does the number of votes matter?  These aren't problematic at all, they're all write-ins, even if there are multiple lines.  
drop temp max
*How many republicanwritein and democraticwritein cands are there?
tab vote party if party=="republicanwritein"|party=="democraticwritein"
*17 repub, 23 dem, and some get very large vote numbers, 35k and 23k being the two highest.  
list year sid sen dno party vote if party=="republicanwritein"|party=="democraticwritein"
*Re-verify that partyt doesn't vary within a candidate-election.
bysort year sid sen dname dno geopost mmdpost etype candid: gen sum1=_N
bysort year sid sen dname dno geopost mmdpost etype candid partyt: gen sum2=_N
assert sum1==sum2
*No problem.  
drop sum1 sum2
*Those are all d or r.  With the above established, partyz can be ignored.  
gen writein=partyt=="writein"
replace writein=1 if party=="99993"&vote<10&rows==1
replace writein=1 if party=="99994"&vote<10&rows==1
replace writein=1 if party=="99997"&vote<10&rows==1
replace writein=1 if party=="99998"&vote<10&rows==1
replace writein=1 if party=="99993"&vote<20&rows==1
replace writein=1 if party=="99994"&vote<20&rows==1
replace writein=1 if party=="99997"&vote<20&rows==1
replace writein=1 if party=="99998"&vote<20&rows==1
egen sum1=sum(vote), by(year sid sen dname dno geopost mmdpost etype)
egen sum2=sum(vote), by(year sid sen dname dno geopost mmdpost etype candid)
*Adjust the amount upwards if there is more than one seat in the contest.  
gen tempvoteper=((sum2*eseats)/sum1)*100
drop sum1 sum2
replace writein=1 if party=="99993"&tempvoteper<2&rows==1
replace writein=1 if party=="99994"&tempvoteper<2&rows==1
replace writein=1 if party=="99997"&tempvoteper<2&rows==1
replace writein=1 if party=="99998"&tempvoteper<2&rows==1
replace writein=1 if tempvoteper<1&vote<50
replace partyt="writein" if writein==1&rows==1
*Re-verify that partyt doesn't vary within a candidate-election.
bysort year sid sen dname dno geopost mmdpost etype candid: gen sum1=_N
bysort year sid sen dname dno geopost mmdpost etype candid partyt: gen sum2=_N
assert sum1==sum2
*No problem.  
drop sum1 sum2

*The following drops write-in candidates if the same individual is running in another district contemporaneously.  People often write in a candidate's name in a neighboring district, especially after redistricting where they write in an an incumbent they would like to still be able to vote for.  
egen min=min(writein), by(year candid)
tab year sid if writein==1&min==0&rows==1
*There are only 11, 8 in RI 2002.  This is probably a mistake in the returns, which I'd guess I already looked into.  How many votes do they have?
tab vote if writein==1&min==0&rows==1
*Most are extremely small amounts, but there are two that have more than 500, and two with between 200 and 500.  
gen temp=1 if writein==1&min==0&rows==1&vote>200&vote!=.
egen max=max(temp), by(year sid sen dname dno geopost mmdpost etype)
*dname, geopost or mmdpost are never obs in the following.  
list year sid sen dno etype cand partyt vote temp if max==1
*compare them to themselves.  Do they have the same number of votes?
gen temp2=1 if writein==1&min==0&rows==1
egen max2=max(temp2), by(year candid)
sort candid year sen dno
list year sid sen dno etype cand partyt vote temp2 if max2==1
*Many of those are clearly writeins.  I'm just going to drop them.  
drop if writein==1&min==0&rows==1
*11 obs dropped, good.
drop rows tempvote writein tempvoteper min temp max temp2 max2
*The following drops write-in candidates who received less than 5% of the total vote (as recomputed).  The assumption here is that they are inconsistently reported across states, so if they are kept, states that are less apt to report writeins will have contests that are less likely to be coded as having a large third party.  To equalize exclusion rules across states, write-ins that receive very few votes are excluded from all states.  
egen sum1=sum(vote), by(year sid sen dname dno geopost mmdpost etype)
egen sum2=sum(vote), by(year sid sen dname dno geopost mmdpost etype candid)
*Adjust the amount upwards if there is more than one seat in the contest.  
gen tempvoteper=((sum2*eseats)/sum1)*100
drop sum1 sum2
drop if partyt=="writein"&tempvoteper<5
*That deleted 244 more cases
tab partyt
*196 writeins left
tab tempvoteper if partyt=="writein"
*about 5% are winners.  
*Are any incumbents?  They may not actually be running.  
tab tempvoteper exper if partyt=="writein"
gen quality=exper!="none"
tab quality if partyt=="writein"
logit quality tempvoteper if partyt=="writein"
*80 out of 196 are incumbents or prior office holders, and the writein in question is more likely to be a prior office holder if they received a larger percent of the vote they obtained in the election.  Nothing definitive can be inferred about which of these are "real" candidates from this.  
drop tempvoteper quality
*I judged the following person wasn't actually on the ballot, although they were reported as having 0 votes in the returns.  
drop if year==2002&sid==8&sen==0&dno==9&partyz=="d"
*They're gone anyway, no obs dropped.  

*VOTEMISS
replace votemiss=vote==.

*FIRSTCASE
recode firstcase (1/2=1)
replace firstcase=1 if sid==23&(year==1974|(year==1976&sen==1&dno!=47&dno!=64))

*UOA CAND
collapse (mean) votemiss (sum) vote (max) dontuse firstcase, by(year sid sen dname dno geopost mmdpost specpost cand candid termz outcome exper tenure1 tenure2 deter etype eseats dseats dtype popnum regime redist redist1 redist2 redist3 nest nest1 nest2 nest3 partyt)

*VOTEMISS
*verify that votemiss is either 0 or 1, and not in between.
assert votemiss==0|votemiss==1
*that is correct
*replace vote=0 with system missing to deal with Stata's problem with this.  
replace vote=. if votemiss==1

*IDENTIFICATION
*verify that each candidate is only observed once per election.  
bysort year sid sen dname dno geopost mmdpost etype cand candid: assert _N==1
*all 1, good

*PARTYWEIRD
*Create a variable tracking party switches since the last time they ran, as long as it was within four years, and from what party to what.  
*don't consider it a party switch if they were a write-in or non-major party candidate in the past.  I'm going to assume that if someone runs as a dem or repub after being a third party candidate, that won't effect the vote they obtain.
tab vote if partyt=="writein"
sum vote if partyt=="writein"
*most of those obtained large numbers of votes, and the ave is about 2.6k.  
*The following is because it doesn't matter if someone went from nonmaj to partymiss.  If they go from dem to partymiss, that would still register.  
gen partyt2=partyt
replace partyt2="nonmaj" if partyt=="nonpart"|partyt=="partymiss"
bysort candid (year): gen id=1 if _n==1
replace id=sum(id)
bysort id (year): gen row=_n
tsset id row
by id: gen lag=partyt2[_n-1]
by id: gen yearlag=year[_n-1]
gen yeardif=year-yearlag
*"switch" tracks whether there has been a d to r or r to d switch.
gen dswitch=partyt2=="d"&lag=="r"&yeardif<5
gen rswitch=partyt2=="r"&lag=="d"&yeardif<5
*"stealth" tracks whether a non-major party cand was a d or r in the recent past, even if they're a writein (only writeins with a lot of votes are left in at this point).  
gen dstealth=partyt2=="nonmaj"&lag=="d"&yeardif<5
gen rstealth=partyt2=="nonmaj"&lag=="r"&yeardif<5
by id: gen outcomelag=outcome[_n-1]
gen dswitchwin=dswitch==1&outcomelag=="w"
gen rswitchwin=rswitch==1&outcomelag=="w"
gen dstealthwin=dstealth==1&outcomelag=="w"
gen rstealthwin=rstealth==1&outcomelag=="w"
drop partyt2 id row lag yearlag yeardif outcomelag
*DESCRIPTIVES
*% of contests
*party switchers: .58
*party switchers who won last time: .45
*stealth partisans: .39
*stealth partisans who won last time: .21
*Total: .98% of all contests have some kind of party weirdness, and .66 have some kind of party weirdness with a winner from the past.  
*CUMULATIVE CUTS
gen cumulativecuts=0
foreach string in dswitch rswitch dstealth rstealth dswitchwin rswitchwin dstealthwin rstealthwin {
replace cumulativecuts=1 if `string'==1
}

*KEYVARS
*create variables tracking cands, wins, incs, inc2s, inc3s, votes and prior legislative experience.
rename cand candname
gen cand=1
gen win=outcome=="w"
gen inc=exper=="inc"
recode tenure1 (0/3=0) (4/7=1) (8/max=0), gen(inc2)
recode tenure1 (0/7=0) (8/max=1), gen(inc3)
recode tenure2 (0/3=0) (4/7=1) (8/max=0), gen(leg2)
recode tenure2 (0/7=0) (8/max=1), gen(leg3)
gen other=exper=="other"
gen past=exper=="pastinc"|exper=="pastother"|exper=="pastboth"
*pull out # cands, votes and wins by party
foreach string in cand vote win inc inc2 inc3 other leg2 leg3 past {
gen d`string'=`string' if partyt=="d"
gen r`string'=`string' if partyt=="r"
gen o`string'=`string' if partyt!="d"&partyt!="r"
}
drop cand exper inc inc2 inc3 other leg2 leg3 past

*TERMZ
replace termz=1 if termz==1.5
*create weights for termz
bysort year sid sen dname dno geopost mmdpost etype: gen sum1=_N
bysort year sid sen dname dno geopost mmdpost etype termz: gen sum2=_N
gen dif=sum1-sum2
egen max=max(termz), by(year sid sen dname dno geopost mmdpost etype)
*the following is non-zero for just a few elections
gen minweight=win==1&termz!=max
*the following is non-zero for all elections.  
gen maxweight=win==1&(dif==0|(termz==max))
drop sum1 sum2 dif win max

*VOTEMISS
*The following non-maj cand is sysmis for vote, but can be dropped without hurting anything.  There should be one obs dropped in the following.
drop if year==2010&sid==44&sen==0&dno==52&partyt=="nonmaj"&vote==.
egen mean=mean(votemiss), by(year sid sen dname dno geopost mmdpost)
list year sid sen dname dno geopost mmdpost partyt votemiss vote outcome dontuse if mean!=0&mean!=1
*There are four cases from two such elections.  They should be dontuse=1, bcs both have two cands, one d and one r, and one cand in each election is sysmis for vote.  Taking the max of votemiss will accomplish this when it is done later.  
sum sid if mean!=0&mean!=1
local aaa=r(N)
assert `aaa'==4
drop mean
save $mainfile, replace



*MAINFILE
clear
use $mainfile

*UOA CONTEST
collapse (sum) dvote rvote ovote dcand rcand ocand dwin rwin owin dinc rinc oinc dinc2 rinc2 oinc2 dinc3 rinc3 oinc3 dother rother oother dleg2 rleg2 oleg2 dleg3 rleg3 oleg3 dpast rpast opast dswitch dswitchwin rswitch rswitchwin dstealth rstealth dstealthwin rstealthwin maxweight minweight (max) votemiss dontuse maxtermz=termz firstcase cumulativecuts (min) mintermz=termz, by(year sid sen dname dno geopost mmdpost specpost deter etype eseats dseats dtype popnum regime redist redist1 redist2 redist3 nest nest1 nest2 nest3 specpost)

*VOTEMISS
*replace vote=0 with system missing to deal with Stata's problem with this.  
recode dvote rvote ovote (*=.) if votemiss==1
*MS HS 1975 #28, GEOPOST=2, MMDPOST=2 is in there twice, once as deter=0&etype=g, and once as deter=1&etype=s.  This was apparently an unresolved general election.  eseats=1.  
list etype dvote rvote ovote dcand rcand ocand dwin rwin owin if year==1975&sid==24&sen==0&dno==28&geopost==2&mmdpost==2
*For simplicity, just drop the special election.  redist also=0 for the general, which shouldn't be the case.  
replace redist=1 if year==1975&sid==24&sen==0&dno==28&geopost==2&mmdpost==2
drop if year==1975&sid==24&sen==0&dno==28&geopost==2&mmdpost==2&etype=="s"
*one obs dropped from each, good

*IDENTIFICATION
bysort year sid sen dname dno geopost mmdpost: assert _N==1

*PROPS
*replace vars with proportions
*The next four are just to make the code work.
gen oswitch=0
gen ostealth=0
gen oswitchwin=0
gen ostealthwin=0
foreach string in cand win inc inc2 inc3 other leg2 leg3 past switch switchwin stealth stealthwin {
foreach party in d r o {
*adjust number of cands when election is over-contested.  This will get top two primary state-years and NV, but that's part of the plan, they are uncontested elections.  
replace `party'`string'=eseats if `party'`string'>eseats&`party'`string'!=.
replace `party'`string'=`party'`string'/eseats
}
}
drop oswitch ostealth oswitchwin ostealthwin
*for one party, inc+oth+past can't be more than 1.  Assume incumbents will beat other who will beat past.  That means if inc>1, adjust inc down to 1.  Then if inc+other>1, subtract other by the amount more than 1 that sum is.  Then, if inc+other(new)+past is >1, subtract past by the amount that sum is greater than 1 by.  
foreach party in d r {
gen temp=`party'inc+`party'other-1
replace temp=0 if temp<0
replace `party'other=`party'other-temp
replace `party'other=0 if `party'other<0
replace temp=`party'inc+`party'other+`party'past-1
replace temp=0 if temp<0
replace `party'past=`party'past-temp
replace `party'past=0 if `party'past<0
drop temp
}

*UNCONT
*mixeduncont
*These are contests in which the total number of major party candidates (dems + repubs) equals the number of seats to be won and so the party of the winners is known in advance.  
gen mixeduncont=((dcand+rcand)==1)&dcand!=1&rcand!=1
*partuncont
gen partuncont=([(dcand>.01)&(dcand<.99)]|[(rcand>.01)&(rcand<.99)])&dcand!=0&rcand!=0&mixeduncont!=1
assert eseats>1 if partuncont==1
*uncont
gen uncont=rcand==0|dcand==0|mixeduncont==1
assert uncont==0 if partuncont==1
assert partuncont==0 if uncont==1
*DESCRIPTIVES
*mixeduncont: .16%
*partuncont: 2.04%
*CUMULATIVE CUTS
replace cumulativecuts=cumulativecuts+1 if mixeduncont==1|partuncont==1
tab cumulativecuts
*CUMULATIVE CUTS: 3.10%

*VOTESHARE
gen dper=(dvote/(dvote+rvote))*100
replace dper=100 if dwin==1&dcand==1&rcand==0&dontuse==0
replace dper=0 if rwin==1&dcand==0&rcand==1&dontuse==0
*If dper is system missing and the election is fully or partially contested by the major parties, make dontuse=1.  
replace dontuse=1 if dper==.&uncont==0
*0 changes with the above.  
*make sure that there is no repub cand when dper=1 and no dem cand when rper=0
assert dcand==0 if dper==0
*always 0, good
assert rcand==0 if dper==100
*always 0, good
sum sid if dontuse==0&votemiss==1&dper!=100&dper!=0&uncont==0
local aaa=r(N)
assert `aaa'==0
*always 0, good

*BIGTHIRD
*Exclude cases with a strong non-major party presence.  
*Scores of "1" indicate a large proportion of third party votes.
*Scores of "2" track elections with 1) no dem & no repub cands, 2) at least one nonmaj winner, 3) at least one nonmaj inc, 4) at least one nonmaj legislator from the other chamber, or 5) at least one nonmaj legislator who served in the past.  
gen bigthird=0
gen oper=(ovote*eseats)/(dvote+rvote+ovote)
replace bigthird=1 if oper>.2&oper!=.&dcand==1&rcand==1
drop oper
replace bigthird=2 if (dcand==0&rcand==0)|(owin!=0&owin!=.)|(oinc!=0&oinc!=.)|(oother!=0&oother!=.)|(opast!=0&opast!=.)
*CUMULATIVE CUTS
replace cumulativecuts=cumulativecuts+1 if bigthird==1|bigthird==2
tab cumulativecuts
*CUMULATIVE CUTS: 3.72%

*D MINUS R
*Compute as dif of dem and repub
*Change to system missing when not to be used.
foreach string in cand inc inc2 inc3 other leg2 leg3 past switch switchwin stealth stealthwin {
gen `string'=d`string'-r`string'
}

*Merge file that reports the proportion of seats up in each legislative chamber here.  
merge m:1 year sid sen using tempchamberseats
erase tempchamberseats.dta
*merge=1 only in the special elections in 2015 and 2017, if anything else is observed, there is a problem.  
drop if _merge==2
assert _merge==3 if year!=2015&year!=2017
drop _merge

save $mainfile, replace

*OVERVIEW: LAGGING VARIABLES
*The general approach is to create a separate file for the lagged variables, and then replace year with year+termz, then merge these back into the main file.  
*After the merge, lagged variables are changed to system missing if there has been redistricting.  
*This is done (as opposed to the typical way of lagging variables) to solve the following problems.
*1) It allows alternating seats to "leap-frog" over each other.
*Alternating seats are in .59% of contests.  
*CUMULATIVE CUTS
replace cumulativecuts=cumulativecuts+1 if dtype>3&dtype<7
tab cumulativecuts
*CUMULATIVE CUTS: 4.34%
*2) It allows districts that have changed their designations, but not their boundaries, to have lagged values.  .75% of contests.  
*CUMULATIVE CUTS
replace cumulativecuts=cumulativecuts+1 if redist==2|redist==7
tab cumulativecuts
*CUMULATIVE CUTS: 4.99%
*3) (Very minor) It allows incumbents who switch their posts in post-MMDs to have their lagged values associated with the other post.  
*CUMULATIVE CUTS
destring specpost, force gen(temp)
replace cumulativecuts=cumulativecuts+1 if temp!=.
tab cumulativecuts
*CUMULATIVE CUTS: 5.02%
*4) (Very minor) It allows FFA-MMDs that had some of their seats up two years ago, and some of their seats up four years ago, receive values from both of those contests to make its lagged values.  
*CUMULATIVE CUTS: I'm not figuring this one out
*5) (Very minor) Deals with unique problem for ID 1976
*CUMULATIVE CUTS
replace cumulativecuts=cumulativecuts+1 if sid==12&year==1976
tab cumulativecuts
*CUMULATIVE CUTS: 5.09%
*6) It allows alternative lagged variables to be more easily utilized.
*Alternative lagged variables are lagged values from 1) the other chamber when there is nesting or identical seats, 2) another post when there are post-mmds, or 3) lagged values two years ago when there are alternating seats in a district.  
*7) Convenient way to generate cases out into the future for forecasting: generates a list of seats with their lagged values for (say) the 2018 election before it happens.  
*8) It allows "lag aid" variables to be used to prevent many variables for many cases from being turned to system missing if there is redistricting for chambers with four year staggered terms.  
*% of cases this would effect: would vary

*LAGAIDVARS
*Even with redistricting, a lagged variable doesn't need to be made sysmis for a case if there is no variation in that variable among the entire area that potentially could be in the contemporaneous district the case represents.  
*Examples of potential variables.  
*State or national level variables
*other
*past
*With no additional geographic information beyond what state legislative district the locale is, this would mean that one could only know this if 1) all seats in a chamber were up last time, and there is no variation among those seats, or 2) not all seats were up last time, but between the second to last and last elections before redistricting there was no variation in the variable in question.  Don't do this if dontuselag is ever equal to "1" in the chamber in the period in question.  
*Q: Say there is a senate with four year staggered terms.  Why is it necessary to see if there is no variation in BOTH 1988 and 1990, if the election in question is in 1994?  1988 and 1994 are six years apart?  
*A: The reason is because of deferred voters.  A group of voters in one locale may vote for a state senator in 1988 and not get to vote for a state senator again until 1994.  
*Q: And why do we need 1990 for the election up in 1992?
*A: Because of accelerated voters.  Just because I voted for a state senator for a four year term in 1990 doesn't mean I don't get to vote for another state senator for a four year term in 1992.  
*Most of the lagged variables will have variation, but each variable is tested for this property anyway to take advantage of the few times there isn't for the variables that show the most variation.  
*Preliminary work: create two files to do this.
*file #1 indicates whether there is any variation in the variables that will be lagged in chambers where all the seats are up.  
*file #2 indicates whether there is any variation in the variables that will be lagged among chambers that aren't all up in two election years in a row, but only if no redistricting occurred in the second election year in question.  These will then be matched with elections two and four years in the future (but only if no non-holdover redistricting occurred in the second year).  If a seat is only up twice in the earlier time period range, only keep the second election.  Moving bands four years wide will be used.  (The problem with using three year time ranges is the KY Senate elections when there were briefly five year terms.)
*GLOBAL
*The following are the variables that will be in the lagaid files.  
global lagaidvars dper cand inc inc2 inc3 other leg2 leg3 past switch switchwin stealth stealthwin partuncont mixeduncont uncont dwin bigthird dontuse
*PROGRAM
capture program drop computelagaidvars
program define computelagaidvars
*Collapse to uoa=chamber-year, and compute relevant vars.  
foreach string2 in novar mean miss {
*DEPTH=1
local lagaid`string2' ""
foreach string of global lagaidvars {
*DEPTH=2
gen `string'`string2'=`string'
local lagaid`string2' `lagaid`string2'' `string'`string2'
}
*DEPTH=1
}
foreach string of local lagaidmiss {
*DEPTH=2
replace `string'=`string'==.
}
*DEPTH=1
collapse (sum) eseats (sd) `lagaidnovar' (mean) `lagaidmean' totalseats (max) `lagaidmiss', by(yearlagaid sid sen)
foreach string of global lagaidvars {
*DEPTH=2
replace `string'novar=`string'novar==0
replace `string'novar=0 if `string'miss==1
drop `string'miss
}
*DEPTH=1
end

*FILE #1
*Get SD of variable values within a year group.  Also get the mean of all those when there is no variation and pair them up.  Track missing values of anything, and make novar1 variables "0" if there are any missing cases for a variable.
clear
use $mainfile
rename year yearlagaid
egen max=max(dontuse), by(yearlagaid sid sen)
keep if propup==1|max==0
*Program
computelagaidvars
drop if eseats!=totalseats
drop eseats totalseats
save templagaid1, replace

*File #2
*Four smaller files will be made to construct this file.  One will represent presidential years, the second prez year+1, the third prez year+2 and the fourth prez year+3.
clear
use $mainfile
drop in 1/l
save templagaid2, replace
forvalues aaa=0/3 {
clear
use $mainfile
gen quad=mod(year+`aaa',4)
gen yearnum=quad if quad>0
replace yearnum=4 if quad==0
drop quad
gen yearlagaid=year+(4-yearnum)
di "`aaa'"
tab year yearnum
append using templagaid2
save templagaid2, replace
}
*drop earlier years within yearlagaid2 group, if two or more election years appear.  
egen max=max(year), by(sid sen dname dno geopost mmdpost yearlagaid)
*Dropping redist=1 cases is okay because its only when all seats in a chamber are accounted for and the two election years necessary to account for them all aren't divided by redistricting that any of them are used.  
*However, don't drop a district if it was redistricted within the first two years of a four year group.  Say that the group includes 1987 to 1990.  A district was up in 1988 and was redist=1.  The district wasn't up in 1990.  Then there was redistricting again in 1992.  The way it is now, 1988 wouldn't be used, and it should be used.  
drop if (year!=max)|(redist!=0&yearnum>2)
drop max
egen max=max(dontuse), by(yearlagaid sid sen)
drop if max==1
*get rid of WV Sen cases, these will be dealt with below.  
drop if sid==48&sen==1
*Program
computelagaidvars
drop if eseats!=totalseats
drop eseats totalseats
save templagaid2, replace

*File #3: WVSEN
*Add the WV Senate, except that instead of making sure that all seats were up in the past, the entire state only needs to be covered with mmdpost=1 or mmdpost=2.  If there are etype=gs elections, those can be two years prior to the contemporaneous election instead of four years prior.  
*The WV Senate is unique in that other state senates with four year term alternating seats do not cover the entire state.
clear
use $mainfile
keep if sid==48&sen==1
gen quad=mod(year,4)
*postmmd=1 (& 3) are generally up in presidential years while postmmd=2 (& 4) are generally up in midterm years.  
gen yearlagaid=year if quad==0&(mmdpost==1|mmdpost==3)
replace yearlagaid=year-2 if quad==2&(mmdpost==1|mmdpost==3)&yearlagaid==.
replace yearlagaid=year if quad==2&(mmdpost==2|mmdpost==4)&yearlagaid==.
replace yearlagaid=year-2 if quad==0&(mmdpost==2|mmdpost==4)&yearlagaid==.
*drop earlier years within yearlagaid group, if two or more elections years appear.  
egen max=max(year), by(dno mmdpost yearlagaid)
drop if (year!=max)|[redist!=0&quad==2&(mmdpost==1|mmdpost==3)]|[redist!=0&quad==2&(mmdpost==2|mmdpost==4)]
drop max
egen max=max(dontuse), by(yearlagaid sid sen)
drop if max==1
*Program
computelagaidvars
*17 seats have to be accounted for, as there are 34 seats in the WV state senate.  
drop if eseats!=17
drop eseats totalseats
save templagaid3, replace




*LAGVARs
*Create main file with lagged vars to be merged into the main file.  
*SPLIT
*split lagged file into two files, one with the max weights, one with the min weights.  
*inc2 and inc3 aren't lagged, since they are almost perfectly collinear with inc and inclag when both of the latter are included in a model.  
clear
use $mainfile
*The parts of specpost that are alpha shouldn't be put into mmdpost in the following.  
destring specpost, force gen(temp)
replace mmdpost=temp if temp!=.
drop temp
keep year sid sen dname dno geopost mmdpost maxweight minweight maxtermz mintermz dper dvote rvote ovote cand inc inc2 inc3 other leg2 leg3 past switch switchwin stealth stealthwin bigthird dontuse propup partuncont mixeduncont uncont redist dwin
save tempslerslagged, replace
*LAGFILE1
*Create temp lag file #1, with maxtermz
clear
use tempslerslagged
drop minweight mintermz
gen weight=maxweight
rename maxtermz termz
*Break id 1974 house off and give it posts.  Give it posts here first.  
replace mmdpost=1 if sid==12&sen==0&year==1974
save temp, replace
keep if sid==12&sen==0&year==1974
replace mmdpost=2
append using temp
save temp, replace
*LAGFILE2
*Create temp lag file #2, with mintermz
*id 1974 house doesn't have to be messed with with this one
clear
use tempslerslagged
drop maxweight maxtermz
gen weight=minweight
rename mintermz termz
drop if weight==0
*APPEND
*Put the two lag files together
append using temp
erase temp.dta
*YEAR
rename year yearlag
gen year=yearlag+termz
*In one situation, a district is up for an election in a year ending in "0", and has a four year term.  Two years later, in a year ending in "2", which is a redistricting year, a district with the same number is up for election, and has a two year term.  One inappropriate modeling strategy would result in the district up in the year ending in "0" contributing to the lagged value of the election up in the year ending in "4."  So when more than one election is nested in an election that is going to have a lagged value, and the later one of those elections (in this case, the one taking place in a year ending in "2") has redist!=0, drop the lagged values of the earlier election.  
*Another situation is similar to the above, but is more problematic.  In this example, a district is up for an election in a year ending in "0" and has a four year term.  Two years later, the map is redrawn.  There is no election for a district with that name/number in a year ending in "2."  Two years after that, there is re-redistricting, and the district in question has a value of redist=2.  I don't see the problem.  If redist=1, then no lagged value will arrive there.  If redist=2, it should be in terms of the map that was in place the last time an election in that locale was conducted, which, by the definition of this example, it wasn't in a year ending in "2."  What if the term for the prior district in question was only two years, but it was four years ago when it was up?  Is it possible for a situation to be that unfair/messed up?  Anything is possible.  
bysort year sid sen dname dno geopost mmdpost (yearlag): gen row=_n
tab row
*There are either values of 1 or 2.  There are 39 cases of row=2.  
gen temp=row==2&redist==1
tab temp
*there are 30 cases of temp=1
tab sid sen if temp==1
egen redistproblem=max(temp), by(year sid sen dname dno geopost mmdpost)
tab year sid if redistproblem==1
*The only way this will be messed up is if there's re-redistricting in a year ending in 4, as there was in AK in 2014.  
*states from above
*except for half the cases in nd, all the below are from state senates.  
*ak 76 86 14
*co 84
*hi 84 94
*ia 72 84 94 04 14
*mt 96
*nd 94 04
*or 04
*ut 94
drop if row==1&redistproblem==1
*COLLAPSE
*collapse is necessary as some district-years are now observed twice.  
gen c=1
*VOTEMISS
*Deal with Stata's problem with outputting "0" when summing when it should be sysmis.
gen dvotemiss=dvote==.
gen rvotemiss=rvote==.
gen ovotemiss=ovote==.
collapse (sum) c (mean) dper dvote rvote ovote cand inc inc2 inc3 other leg2 leg3 past switch switchwin stealth stealthwin propup partuncont mixeduncont uncont dwin minweight maxweight (min) lagyearmin=yearlag (max) dvotemiss rvotemiss ovotemiss lagyearmax=yearlag bigthird dontuse redistproblem [fweight=weight], by(year sid sen dname dno geopost mmdpost)
*VOTEMISS
*Deal with Stata's problem with outputting "0" when summing when it should be sysmis.
replace dvote=. if dvotemiss==1
replace rvote=. if rvotemiss==1
replace ovote=. if ovotemiss==1
drop dvotemiss rvotemiss ovotemiss
*RENAME
*Rename match vars
rename dname dnamemerge
rename dno dnomerge
rename geopost geopostmerge
*rename vars with "lag" as the suffix
foreach string in dper dvote rvote ovote cand inc inc2 inc3 other leg2 leg3 past switch switchwin stealth stealthwin propup bigthird dontuse partuncont mixeduncont uncont dwin minweight maxweight {
rename `string' `string'lag
}
drop c
save tempslerslagged, replace

*MAIN FILE
*MERGE
*Merge in lagged variables
*First, alter vars that will be merged on when redist=2.  
clear
use $mainfile
gen dnamemerge=dname
replace dnamemerge=redist1 if redist==2|redist==4|redist==7|redist==9
gen dnomerge=dno
replace dnomerge=redist2 if redist==2|redist==4|redist==7|redist==9
gen geopostmerge=geopost
replace geopostmerge=redist3 if redist==2|redist==4|redist==7|redist==9
*a many to 1 merge must be done because when redist=2, a district designation appears twice.  This won't hurt anything if everything is zeroed out that is redist=1.  The past value is being put into two districts, but since redist=1 for one, the inappropriately matched one will be changed to system missing.  
merge m:1 year sid sen dnamemerge dnomerge geopostmerge mmdpost using tempslerslagged
erase tempslerslagged.dta
drop if year>2016
tab year _merge
*For earlier years, the merge=2 cases cluster in redistricting years, as expected.
*For merge=1, those are also clustered in redistricting years, as expected.  
tab year _merge if sen==1|(sen==0&sid==34&year>1998)
tab year _merge if sen==0&(!(sid==34&year>1998))
tab sid _merge if sen==0&_merge!=3
tab year redist if _merge==1&sen==0&(!(sid==34&year>1998))
*I believe the merge=1 cases are orphans, the counter-parts of redist=2, 4, 7 and 9 cases.  Say district #4 became district #2.  Then after redistricting, contemporaneous district #4 would receive no lagged value (which doesn't matter, since it would be changed to system missing anyway).  This is consistent with the fact that only eight contests with redist=2 are merge=1 for state houses, out of 7470 merge=1 cases.  
*Merge=2 cases can't be dropped yet, to do the below.  
drop if _merge==2
drop dnamemerge dnomerge geopostmerge _merge

*YEARLAG
gen tempdif=lagyearmax!=lagyearmin
tab tempdif redistproblem
*redistproblem is always associated with tempdif=0, good.
assert tempdif==0 if redistproblem==1
tab redist if redistproblem==1
*redist=0 for 21 cases, redist=1 for 6 cases, redist=4 for 1 case, and redist=8 for 1 case.  Redist=0 is no problem.  The redist=1 cases are no problem, they will simply be changed to system missing like all the other lagged values in that circumstance.  redist=4 is a combo of redist=2 and redist=3, and I don't see any problem with that.  redist=8 isn't a problem, the earlier value of the pair has been dropped.  
*There are 6 cases of tempdif=2.
sort sid sen dname dno geopost mmdpost year
list sid sen dname dno geopost mmdpost year tempdif if tempdif>0&tempdif!=.
*None of those are problematic, they are different streams coming together as they should.  
drop tempdif redistproblem

*LAGS, REDIST & MISS
*Make vars sysmis if redistricting occurred.  Note that the year the lagged value is coming from has to be turned to system missing if there was redistricting.  Many of these will be filled in after the below.  
foreach string in dperlag candlag inclag inc2lag inc3lag otherlag leg2lag leg3lag pastlag switchlag switchwinlag stealthlag stealthwinlag partuncontlag mixeduncontlag uncontlag dwinlag propuplag bigthirdlag dontuselag lagyearmin lagyearmax {
replace `string'=. if redist==1|redist==6|redist==8
}

*Fill in lagyearmin and lagyearmax when possible.  
*If termz is always "4" in a chamber, then make lagyearmin/max always equal year-4.  Same for always "2."  
egen mintermz2=min(mintermz), by(sid sen)
egen maxtermz2=max(maxtermz), by(sid sen)
forvalues aaa=2(2)4 {
gen always`aaa'=(mintermz2==maxtermz2)&mintermz==`aaa'
foreach string in min max {
replace lagyear`string'=year-`aaa' if always`aaa'==1&lagyear`string'==.
}
drop always`aaa'
}
drop mintermz2 maxtermz2
tab sid if lagyearmin==.&sen==0
*some are left, I can see why.  etype=ssg elections.  
gen miss=lagyearmin==.&firstcase==0
foreach num in 1 9 17 18 24 46 {
di "`num'"
tab year miss if sen==0&sid==`num'
}
*The above implies the following will work.  
replace lagyearmin=1982 if sen==0&sid==1&year==1983
replace lagyearmin=1981 if sen==0&sid==17&year==1984
replace lagyearmin=1991 if sen==0&sid==24&year==1992
replace lagyearmin=1981 if sen==0&sid==46&year==1982
replace lagyearmin=year-4 if sen==0&(sid==1|sid==18|sid==24)&lagyearmin==.
replace lagyearmin=year-2 if sen==0&(sid==9|sid==17|sid==46)&lagyearmin==.
replace lagyearmax=lagyearmin if lagyearmax==.
tab sid if lagyearmin==.&sen==0
*only nd remains, and I can't do much with that.  
*SEN
tab sid if lagyearmin==.&sen==1
*Of the states in that list, the following are the only ones I can do something about.  
foreach num in 1 16 18 22 23 24 30 31 39 40 46 {
di "`num'"
tab year miss if sen==1&sid==`num'
}
*The above implies the following will work.  
replace lagyearmin=1982 if sen==1&sid==1&year==1983
replace lagyearmin=1991 if sen==1&sid==24&year==1992
replace lagyearmin=1980 if sen==1&sid==39&year==1983
*THESE2
gen these2=sen==1&lagyearmin==.&(sid==23|sid==30|sid==39)
*THESE4
gen these4=sen==1&lagyearmin==.&(sid==1|sid==18|sid==22|sid==24|sid==46)
*sid=16
replace these4=1 if sen==1&lagyearmin==.&sid==16&(year==1972|year==1992|year==2004)
*sid=31
replace these4=1 if sen==1&lagyearmin==.&sid==31&(year==1984|year==2004|year==2012)
*sid=40
replace these4=1 if sen==1&lagyearmin==.&sid==40&(year==1972|year==1984|year==2004|year==2012)
*FILLIN
replace lagyearmin=year-2 if these2==1
replace lagyearmin=year-4 if these4==1
replace lagyearmax=lagyearmin if lagyearmax==.
drop these4 these2
*Again
*look at sids again
tab sid if firstcase==0&lagyearmin==.&sen==1
*some of those state senates had all their seats up in years ending in two.  They probably aren't missing lagyearmin, though.  
drop miss
gen miss=lagyearmin==.&firstcase==0
foreach num in 2 4 6 8 9 11 13 14 15 16 17 25 26 28 31 34 35 36 37 38 40 42 43 44 47 48 49 50 {
di "`num'"
tab year miss if sen==1&sid==`num'
}
*None of those can be filled in.  
drop miss
save $mainfile, replace

*LAGSYSMIS
*Bring in the codes that indicate that lagged values that were changed to system missing have no variation in the variable in question in the entire chamber-year (or year group, if appropriate) among the possible lagged values.  
clear
use $mainfile
*are lagyearmin and lagyearmax ever dif when there is redistricting?  I think by virtue of how I filled those in, that wouldn't be possible.  
*gen lagyeardif=lagyearmin!=lagyearmax
*tab lagyeardif
*There are only 5 difs total, that is a surprise, I thought there were way more.  
*CHECK INTO THE ABOVE SOME OTHER TIME
assert redist==0 if lagyearmin!=lagyearmax
*Those are all redist=0, that makes things a lot easier.  
*LAGAID1
rename lagyearmin yearlagaid
merge m:1 sid sen yearlagaid using templagaid1
erase templagaid1.dta
drop if _merge==2
drop _merge
foreach string of global lagaidvars {
replace `string'lag=`string'mean if `string'lag==.&`string'novar==1
drop `string'mean `string'novar
}
rename yearlagaid lagyearmin
*LAGAID2
*Bring lag aid #2 ahead in time two years.  
gen yearlagaid=year-2
merge m:1 sid sen yearlagaid using templagaid2
drop if _merge==2
drop _merge
foreach string of global lagaidvars {
replace `string'lag=`string'mean if `string'lag==.&`string'novar==1
drop `string'mean `string'novar
}
*Bring lag aid #2 ahead in time four years.  
*Only allow sysmis to be replaced by a value if redist is some type of holdover redistricting (or no redistricting, but that possibility shouldn't be present).  If it is sysmis & not holdover redistricting, then it's an instance of re-redistricting (excepting cases in the beginning of the dataset).  
replace yearlagaid=year-4
merge m:1 sid sen yearlagaid using templagaid2
erase templagaid2.dta
drop if _merge==2
drop _merge
foreach string of global lagaidvars {
replace `string'lag=`string'mean if `string'lag==.&`string'novar==1&(redist==0|redist>5)
drop `string'mean `string'novar
}
*LAGAID3(WVSEN)
merge m:1 sid sen yearlagaid using templagaid3
erase templagaid3.dta
drop if _merge==2
drop _merge
foreach string of global lagaidvars {
replace `string'lag=`string'mean if `string'lag==.&`string'novar==1
drop `string'mean `string'novar
}
drop yearlagaid

*UNCONTLAG-INTERACTIONS
*Interactions between uncontlag and inclag, otherlag and pastlag.  
*The impact of past incumbency on change in vote share will be different depending on whether the last election was contested or not.  
gen incuncontlag=inclag*uncontlag
*Are there cases of an incumbent of one party being the only cand of their party running and being opposed by two cands of the other party?
tab inclag candlag if eseats==2
*there are 27 cases like that, unfortunately not 0, but it's not enough to justify modeling it.  Additionally, it will already be controlled for additively for both Xs (partially contested / incumbency).  
gen otheruncontlag=otherlag*uncontlag
gen pastuncontlag=pastlag*uncontlag
tab incuncontlag
tab otheruncontlag
tab pastuncontlag
*Those look good.  

*STATE
merge m:1 sid using 000_StateCodes
drop if _merge==2
drop _merge

*IDENTIFICATION
bysort year sid sen dname dno geopost mmdpost: assert _N==1
*Those identify the data, good.  

*ORGANIZE
sort year sid sen dname dno geopost mmdpost etype

order ///
year ///
state ///
sid ///
sfips ///
sab ///
sen ///
dname ///
dno ///
geopost ///
mmdpost ///
specpost ///
dtype ///
dseats ///
popnum ///
redist ///
redist1 ///
redist2 ///
redist3 ///
regime ///
nest ///
nest1 ///
nest2 ///
nest3 ///
etype ///
deter ///
eseats ///
firstcase ///
seatsup ///
totalseats ///
propup ///
dvote ///
rvote ///
ovote ///
dcand ///
rcand ///
ocand ///
dwin ///
rwin ///
owin ///
dinc ///
rinc ///
oinc ///
dinc2 ///
rinc2 ///
oinc2 ///
dinc3 ///
rinc3 ///
oinc3 ///
dother ///
rother ///
oother ///
dleg2 ///
rleg2 ///
oleg2 ///
dleg3 ///
rleg3 ///
oleg3 ///
dpast ///
rpast ///
opast ///
dswitch ///
rswitch ///
dswitchwin ///
rswitchwin ///
dstealth ///
rstealth ///
dstealthwin ///
rstealthwin ///
dontuse ///
votemiss ///
bigthird ///
mixeduncont ///
partuncont ///
uncont ///
dper ///
cand ///
inc ///
inc2 ///
inc3 ///
other ///
leg2 ///
leg3 ///
past ///
switch ///
switchwin ///
stealth ///
stealthwin ///
maxtermz ///
mintermz ///
maxweight ///
minweight ///
propuplag ///
dperlag ///
dwinlag ///
dvotelag ///
rvotelag ///
ovotelag ///
candlag ///
inclag ///
inc2lag ///
inc3lag ///
otherlag ///
leg2lag ///
leg3lag ///
pastlag ///
switchlag ///
switchwinlag ///
stealthlag ///
stealthwinlag ///
dontuselag ///
bigthirdlag ///
mixeduncontlag ///
partuncontlag ///
uncontlag ///
incuncontlag ///
otheruncontlag ///
pastuncontlag ///
lagyearmin ///
lagyearmax ///
minweightlag ///
maxweightlag

save $mainfile, replace




*SEATCHECK
*See if all seats in a chamber are accounted for.  
clear
use $mainfile
egen totalseats2=sum(eseats), by(year sid sen)
gen dif=totalseats-totalseats2
*STATEHOUSES
tab sid dif if sen==0
*one missing election in fl and mn, probably those special elections that were one year after the general.  
*ND is also obviously not correct.  
tab year sid if dif==1&sen==0&sid!=34
*fl 2014
*mn 2016
list dno if sid==9&sen==0&year==2015
*#13 is in 2015.  
list dno geopost if sid==23&sen==0&year==2016&dno==32
*geopost=1 is there, but geopost=2 isn't.  
list dno if sid==23&sen==0&year==2017
*STATESENATES
tab sid dif if sen==1
*Looks good generally.  Of course, with staggered seats, many aren't all up, but the state senates without staggered terms always have the same number of seats up as total seats in the legislature.  
gen dif2=seatsup-totalseats2
tab dif2
*only .18% of cases are in a chamber that is off.  They are only off by one.  
tab sid sen if dif2!=0
*It's just the FL and MN houses.
tab sid year if dif2!=0
*2014, 2015 for FL, 2016 for MN, so they're definitely the ones from above.  
clear

*CHECK
*Check basic variable against an older version of SLERs restructured to the contest unit of analysis.  
clear
use 003_FromForecastFolder103mainfile20180826
keep year sid sen dno dname geopost mmdpost dontuse dper dperlag
foreach string in dontuse dper dperlag {
rename `string' `string'old
}
save temp, replace
clear
use $mainfile
merge 1:1 year sid sen dno dname geopost mmdpost using temp
*lots of merge failures
tab year _merge
*all the merge=2 cases are 2018 cases + one 2017 case
drop if _merge==2
tab sid if _merge==1
*Those are the one party south (and merely one party, in some cases) and odd-year states that were dropped from the forecasting analysis.  
drop if _merge==1
tab dontuse
tab dontuseold
tab dontuse dontuseold
*only about half of them agree.  
tab sid if dontuseold==0&dontuse==1
*I thought a lot of these would be NC, but that wasn't the case.  
tab sid if dontuseold==1&dontuse==0
*a lot of those are TN.  
tab sid if dontuse==1
*a lot of dontuse=1 cases are in NC, good.  
*Is it a difference in how bigthird was used?
tab dontuseold bigthird
*almost no common cases.
tab dontuse bigthird
*almost no common cases.
*So that isn't the cause of the discrepancy.  
*Why are there so many in NY and NH?
tab year if sid==32&dontuse==1
*a lot are in 1996.  
tab year if sid==29&dontuse==1
*spread out.  
tab year if dontuse==1
*spread out.  
*VOTESHARE
reg dper dperold
*perfect cor.
*LAGVOTESHARE
reg dperlag dperlagold
*perfect cor.
clear



*UOA=CHAMBERYEAR
*Create a file that tells the users what proportion of seats are dontuse=1, and what additional proportion of seats are bigthird=1 or 2.  
clear
use $mainfile
gen seatsdontuse=dontuse*eseats
gen bigthird1seats=eseats if bigthird==1&dontuse==0
gen bigthird2seats=eseats if bigthird==2&dontuse==0
recode bigthird1seats bigthird2seats (.=0)
rename eseats seatsup2
collapse (sum) seatsup2 seatsdontuse bigthird1seats bigthird2seats (mean) seatsup, by(year state sid sen)
order state sid sen year seatsup seatsup2 seatsdontuse bigthird1seats bigthird2seats
sort state sen year
export delimited 103uoachamyear20181021.csv, replace

